Objectives

  • Visualize data with clustering techniques including multiple correspondence analysis, multivariate discriminant analysis, and comparison of dendrograms (visual segmentation of customers’ profile).
  • Verify whether the clusters are aligned with the customers’ profiles (education level, marital status, filed a complaint, etc.).
  • Use step-wise techniques in regressions to predict the likelihood of positive customer response to the marketing campaign.

The Dataset

A superstore is planning for the year-end sale. They want to launch a new offer - gold membership, that gives a 20% discount on all purchases, for only $499 which is $999 on other days. It will be valid only for existing customers and the campaign through phone calls is currently being planned for them. The management feels that the best way to reduce the cost of the campaign is to make a predictive model which will classify customers who might purchase the offer.

Source of data: https://www.kaggle.com/datasets/ahsan81/superstore-marketing-campaign-dataset

Variable Description
CUSTOMERS INFORMATION
Response (target) 1 if customer accepted the offer in the last campaign, 0 otherwise
ID Unique ID of each customer
Year_Birth Age of the customer
Dt_Customer Date of customer’s enrollment with the company
Education Customer’s level of education (Basic, Second Cycle, Graduation, Master, PhD)
Marital Customer’s marital status (Single, Together, Married, Divorced, Other)
Kidhome Number of small children in customer’s household (0, 1, 2)
Teenhome Number of teenagers in customer’s household (0, 1, 2)
Income Customer’s yearly household income
Complain 1 if the customer complained in the last 2 years
Recency Number of days since the last purchase
AMOUNT OF MONEY SPENT ON TYPES OF PRODUCTS
MntFishProducts The amount spent on fish products in the last 2 years
MntMeatProducts The amount spent on meat products in the last 2 years
MntFruits The amount spent on fruits products in the last 2 years
MntSweetProducts The amount spent on sweet products in the last 2 years
MntWines The amount spent on wine products in the last 2 years
MntGoldProds The amount spent on gold products in the last 2 years
NUMBER OF PURCHASES MADE THROUGH TYPES OF CHANNELS
NumDealsPurchases Number of purchases made with discount
NumCatalogPurchases Number of purchases made using catalog (buying goods to be shipped through the mail)
NumStorePurchases Number of purchases made directly in stores
NumWebPurchases Number of purchases made through the company’s website
NumWebVisitsMonth Number of visits to company’s website in the last month

For this project, we are assuming that the survey ended in December 2014 and we are doing our analyses in January 2015. This will be the basis for transforming some of our date-time variables later.

Exploratory Data Analysis

# overview of data attributes
str(super)
## 'data.frame':    2240 obs. of  22 variables:
##  $ Id                 : int  1826 1 10476 1386 5371 7348 4073 1991 4047 9477 ...
##  $ Year_Birth         : int  1970 1961 1958 1967 1989 1958 1954 1967 1954 1954 ...
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Together" ...
##  $ Income             : int  84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Dt_Customer        : chr  "6/16/2014" "6/15/2014" "5/13/2014" "11/5/2014" ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...

From a quick look at the dataset, we are not interested in the birth year of the customers or the date of customers’ enrollment with the company. However, the age profile of the customers might have some relationship with the customers’ response. As such, we will create a new column called Age from the Year_Birth column. We will also transform the date of customers’ enrollment into the number of days since the customers joined (counting up until 01/01/2015)in a new column called days_joined. Then, since Year_Birth and Dt_Customer contain redundant information, we will remove them from our dataset.

# transform Year_Birth into Age
yr <- 2015
super["Age"] <- yr - super["Year_Birth"]
# set today as 01/01/2015
today <- as.Date("01/01/2015", format = "%m/%d/%Y")

# reformat the date in Dt_Customer into the YYYY-MM-DD format
super$Dt_formatted <- as.POSIXct(super$Dt_Customer, format="%m/%d/%Y")

# calculate the number of days since customers joined
super$days_joined <- as.numeric(difftime(today, super$Dt_formatted), unit="days")
# remove redundant columns (Year_Birth, Dt_Customer and Dt_formatted)
super <- subset(super, select = -c(Year_Birth, Dt_Customer, Dt_formatted))
# descriptive summary of the data
summary(super)
##        Id         Education         Marital_Status         Income      
##  Min.   :    0   Length:2240        Length:2240        Min.   :  1730  
##  1st Qu.: 2828   Class :character   Class :character   1st Qu.: 35303  
##  Median : 5458   Mode  :character   Mode  :character   Median : 51382  
##  Mean   : 5592                                         Mean   : 52247  
##  3rd Qu.: 8428                                         3rd Qu.: 68522  
##  Max.   :11191                                         Max.   :666666  
##                                                        NA's   :24      
##     Kidhome          Teenhome         Recency         MntWines      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.00   Min.   :   0.00  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:24.00   1st Qu.:  23.75  
##  Median :0.0000   Median :0.0000   Median :49.00   Median : 173.50  
##  Mean   :0.4442   Mean   :0.5062   Mean   :49.11   Mean   : 303.94  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:74.00   3rd Qu.: 504.25  
##  Max.   :2.0000   Max.   :2.0000   Max.   :99.00   Max.   :1493.00  
##                                                                     
##    MntFruits     MntMeatProducts  MntFishProducts  MntSweetProducts
##  Min.   :  0.0   Min.   :   0.0   Min.   :  0.00   Min.   :  0.00  
##  1st Qu.:  1.0   1st Qu.:  16.0   1st Qu.:  3.00   1st Qu.:  1.00  
##  Median :  8.0   Median :  67.0   Median : 12.00   Median :  8.00  
##  Mean   : 26.3   Mean   : 166.9   Mean   : 37.53   Mean   : 27.06  
##  3rd Qu.: 33.0   3rd Qu.: 232.0   3rd Qu.: 50.00   3rd Qu.: 33.00  
##  Max.   :199.0   Max.   :1725.0   Max.   :259.00   Max.   :263.00  
##                                                                    
##   MntGoldProds    NumDealsPurchases NumWebPurchases  NumCatalogPurchases
##  Min.   :  0.00   Min.   : 0.000    Min.   : 0.000   Min.   : 0.000     
##  1st Qu.:  9.00   1st Qu.: 1.000    1st Qu.: 2.000   1st Qu.: 0.000     
##  Median : 24.00   Median : 2.000    Median : 4.000   Median : 2.000     
##  Mean   : 44.02   Mean   : 2.325    Mean   : 4.085   Mean   : 2.662     
##  3rd Qu.: 56.00   3rd Qu.: 3.000    3rd Qu.: 6.000   3rd Qu.: 4.000     
##  Max.   :362.00   Max.   :15.000    Max.   :27.000   Max.   :28.000     
##                                                                         
##  NumStorePurchases NumWebVisitsMonth    Response         Complain       
##  Min.   : 0.00     Min.   : 0.000    Min.   :0.0000   Min.   :0.000000  
##  1st Qu.: 3.00     1st Qu.: 3.000    1st Qu.:0.0000   1st Qu.:0.000000  
##  Median : 5.00     Median : 6.000    Median :0.0000   Median :0.000000  
##  Mean   : 5.79     Mean   : 5.317    Mean   :0.1491   Mean   :0.009375  
##  3rd Qu.: 8.00     3rd Qu.: 7.000    3rd Qu.:0.0000   3rd Qu.:0.000000  
##  Max.   :13.00     Max.   :20.000    Max.   :1.0000   Max.   :1.000000  
##                                                                         
##       Age          days_joined     
##  Min.   : 19.00   Min.   :  25.67  
##  1st Qu.: 38.00   1st Qu.: 366.42  
##  Median : 45.00   Median : 538.71  
##  Mean   : 46.19   Mean   : 537.74  
##  3rd Qu.: 56.00   3rd Qu.: 710.92  
##  Max.   :122.00   Max.   :1088.67  
## 

Data Cleaning

# visualize missing data
vis_miss(super) +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5))

# total number of missing values
cat("\nTotal number of missing values:", sum(is.na(super)))
## 
## Total number of missing values: 24
# percentage of missing data
cat("\nPercentage of missing data:", (sum(is.na(super))/nrow(super))*100, "%")
## 
## Percentage of missing data: 1.071429 %

There are some missing values in the Income variable. Since income might have an impact on the purchase capability of the customers and the missing data constitute only 1.07% of the entire dataset, we will include this variable in our analysis after removing the missing observations.

# remove missing data
super <- na.omit(super)
cat("Number of missing values:", sum(is.na(super)))
## Number of missing values: 0

The Id column represents the unique ID of each customer, so we can use this column to check for any incorrect duplicate entries. Since there are none, we can go ahead and remove the Id column from our dataset.

# duplicates in Id
sum(duplicated(super$Id))
## [1] 0
# remove the Id column
super <- super[-1]

Data Variation

Response Variable

# count and percentage of each categories in satisfaction
response_count <- as.data.frame(table(super$Response))
response_count$percent <- response_count$Freq / sum(response_count$Freq) * 100

# pie chart
ggplot(response_count, aes(x = '', y = Freq, fill = factor(Var1))) +
  geom_bar(stat = 'identity', width = 1) +
  coord_polar(theta = 'y') +
  theme_void() +
  scale_fill_manual(values = c('#3F7ED5', '#C7DFF9'),
                    name = '',
                    labels = c('refused', 'accepted')) +
  labs(title = 'Pie chart of Response') +
  theme(legend.text = element_text(size = 10),
        legend.position = 'right',
        plot.title = element_text(hjust = 0.5, face = 'bold')) +
  geom_text(aes(label = paste0(round(percent, 2), '%')),
            position = position_stack(vjust = 0.5))

Categorical Variables

# list of column names
cat_cols <- colnames(super[c('Education', 'Marital_Status', 'Kidhome', 'Teenhome', 'Complain')])

for (col in cat_cols) {
  # a data frame with the count of each category in the current column
  cat_count <- data.frame(table(super[[col]]))

  # a bar plot of the count data
  print(ggplot(cat_count, aes(x = Var1, y = Freq)) +
    geom_bar(stat = 'identity', fill = '#3F7ED5') +
    geom_text(aes(label = Freq), vjust = -0.5) +
    theme(axis.ticks = element_blank(),
          plot.title= element_text(face = "bold")) +
    labs(title = col, x = '', y = "Count"))
}

Numerical Variables

# name of numerical columns (with more than 20 unique values)
num_cols1 <- super %>%
  select_if(function(x) length(unique(x)) > 20) %>%
  colnames

# histogram for each column
for (col in num_cols1) {
  print(ggplot(super, aes(x = .data[[col]])) +
  geom_histogram(binwidth = 1, color = '#3F7ED5')+
  labs(title = col, x = '', y = 'Frequency') +
  theme(axis.ticks = element_blank(),
        plot.title= element_text(face = "bold")))
}

# name of numerical columns (remaining, with less than or 20 unique values)
num_cols2 <- colnames(super)[!(colnames(super) %in% num_cols1 | colnames(super) %in% cat_cols | colnames(super) %in% "Response")]

# histogram for each numerical column
for (col in num_cols2) {
  print(ggplot(super, aes(x = .data[[col]])) +
  geom_histogram(binwidth = 1, color = '#3F7ED5')+
  labs(title = col, x = '', y = 'Frequency') +
  theme(axis.ticks = element_blank(),
        plot.title= element_text(face = "bold")))
}

Outliers

From the boxplots, some extreme outliers can be detected in MntMeatProducts, NumWebPurchases, NumCatalogPurchases, Income and Age. For example, in Age, there are 3 entries of over 100 years old, or in Income, there is an entry of 666666. Since there is no information on the data collection process and the nature of the data, these outliers can not be validated and therefore, best kept in the dataset as they may represent genuine variation in the data or contain important information.

# types of product
boxplot(super[, c('MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds')],
              main = "The amount of money spent on types of products in the last 2 years",
              names = c("wines", "fruits", "meat", "fish", "sweet", "gold"),
              col = '#3F7ED5'
        )

# types of channels
boxplot(super[, c('NumDealsPurchases', 'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases')],
              main = 'Number of purchases made through types of channels',
              names = c("discount", "website", "catalog", "store"),
              col = '#3F7ED5'
        )

# Income
boxplot(super$Income, 
        main = "Income", 
        col = '#3F7ED5')

# Recency
boxplot(super$Recency, 
        main = "Recency", 
        col = '#3F7ED5')

# NumWebVisitsMonth
boxplot(super$NumWebVisitsMonth,
        main = "Number of visits to the company's website in the last month",
        col = '#3F7ED5')

# Age
boxplot(super$Age, 
        main = "Age", 
        col = '#3F7ED5')

# days_joined
boxplot(super$days_joined, 
        main = "Number of days since joining the program", 
        col = '#3F7ED5')

# outliers in Age
for (i in 1:length(super$Age)) {
  if (super$Age[i] > 100) {
    cat("Outlier:", super$Age[i], " at observation no.", i, "\n")
  }
}
## Outlier: 122  at observation no. 510 
## Outlier: 116  at observation no. 822 
## Outlier: 115  at observation no. 2210
# outlier in Income
cat("Outlier in Income:", max(super$Income), " at observation no.", which(super$Income == max(super$Income)))
## Outlier in Income: 666666  at observation no. 523

Correlation

From the correlation matrix, we can see that Recency has very weak correlation with all other variables. Therefore, we should note to drop Recency from our clustering assessment.

# correlation matrix
corr_matrix <- super %>%
  select_if(is.numeric) %>%
  correlate(diagonal = 1)
## Correlation computed with
## • Method: 'pearson'
## • Missing treated using: 'pairwise.complete.obs'
corr_matrix
## # A tibble: 19 × 20
##    term      Income Kidhome Teenhome  Recency MntWines MntFru…¹ MntMe…² MntFis…³
##    <chr>      <dbl>   <dbl>    <dbl>    <dbl>    <dbl>    <dbl>   <dbl>    <dbl>
##  1 Income   1       -0.429   0.0191  -3.97e-3  0.579    0.431    0.585   4.39e-1
##  2 Kidhome -0.429    1      -0.0399   1.15e-2 -0.497   -0.373   -0.439  -3.89e-1
##  3 Teenho…  0.0191  -0.0399  1        1.38e-2  0.00375 -0.177   -0.261  -2.05e-1
##  4 Recency -0.00397  0.0115  0.0138   1   e+0  0.0157  -0.00584  0.0225  5.51e-4
##  5 MntWin…  0.579   -0.497   0.00375  1.57e-2  1        0.387    0.569   3.98e-1
##  6 MntFru…  0.431   -0.373  -0.177   -5.84e-3  0.387    1        0.548   5.93e-1
##  7 MntMea…  0.585   -0.439  -0.261    2.25e-2  0.569    0.548    1       5.74e-1
##  8 MntFis…  0.439   -0.389  -0.205    5.51e-4  0.398    0.593    0.574   1   e+0
##  9 MntSwe…  0.441   -0.378  -0.163    2.51e-2  0.390    0.572    0.535   5.84e-1
## 10 MntGol…  0.326   -0.355  -0.0199   1.77e-2  0.393    0.396    0.359   4.27e-1
## 11 NumDea… -0.0831   0.217   0.386    2.12e-3  0.00889 -0.135   -0.121  -1.43e-1
## 12 NumWeb…  0.388   -0.372   0.162   -5.64e-3  0.554    0.302    0.307   3.00e-1
## 13 NumCat…  0.589   -0.505  -0.113    2.41e-2  0.635    0.486    0.734   5.33e-1
## 14 NumSto…  0.529   -0.501   0.0497  -4.34e-4  0.640    0.458    0.486   4.58e-1
## 15 NumWeb… -0.553    0.447   0.131   -1.86e-2 -0.322   -0.419   -0.539  -4.46e-1
## 16 Respon…  0.133   -0.0779 -0.154   -2.00e-1  0.246    0.122    0.238   1.08e-1
## 17 Compla… -0.0272   0.0410  0.00331  1.36e-2 -0.0395  -0.00532 -0.0238 -2.12e-2
## 18 Age      0.162   -0.234   0.351    1.63e-2  0.159    0.0177   0.0337  4.04e-2
## 19 days_j… -0.0167  -0.0569  0.00842  3.08e-2  0.149    0.0596   0.0713  7.80e-2
## # … with 11 more variables: MntSweetProducts <dbl>, MntGoldProds <dbl>,
## #   NumDealsPurchases <dbl>, NumWebPurchases <dbl>, NumCatalogPurchases <dbl>,
## #   NumStorePurchases <dbl>, NumWebVisitsMonth <dbl>, Response <dbl>,
## #   Complain <dbl>, Age <dbl>, days_joined <dbl>, and abbreviated variable
## #   names ¹​MntFruits, ²​MntMeatProducts, ³​MntFishProducts
# heatmap of correlation
corr_matrix %>%
  rearrange(method = "MDS", absolute = FALSE) %>%
  shave() %>%
  rplot(shape = 15, colours = c("#D3D3D3", "#3F7ED5")) +
  theme(axis.text.x = element_text(angle = 90, vjust = 1, hjust = 1),
  axis.text.y = element_text(angle = 0, vjust = 1, hjust = 1),
        axis.ticks = element_blank())

Subsets

In order to perform the clustering procedures, we will subset only the numerical variables. Recency and Response will also be excluded. Hence, the variables taken forward for clustering are:

  • Income, NumWebVisitsMonth, Age, days_joined

  • MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts and MntGoldProds

  • NumDealsPurchases, NumWebPurchases, NumCatalogPurchases and NumStorePurchases

# subset the dataset
df <- super[ , !names(super) %in% c(cat_cols, 'Response', 'Recency')]
cat("Subset df\nDimension:", dim(df), "\n\n")
## Subset df
## Dimension: 2216 14
colnames(df)
##  [1] "Income"              "MntWines"            "MntFruits"          
##  [4] "MntMeatProducts"     "MntFishProducts"     "MntSweetProducts"   
##  [7] "MntGoldProds"        "NumDealsPurchases"   "NumWebPurchases"    
## [10] "NumCatalogPurchases" "NumStorePurchases"   "NumWebVisitsMonth"  
## [13] "Age"                 "days_joined"
head(df)
##   Income MntWines MntFruits MntMeatProducts MntFishProducts MntSweetProducts
## 1  84835      189       104             379             111              189
## 2  57091      464         5              64               7                0
## 3  67267      134        11              59              15                2
## 4  32474       10         0               1               0                0
## 5  21474        6        16              24              11                0
## 6  71691      336       130             411             240               32
##   MntGoldProds NumDealsPurchases NumWebPurchases NumCatalogPurchases
## 1          218                 1               4                   4
## 2           37                 1               7                   3
## 3           30                 1               3                   2
## 4            0                 1               1                   0
## 5           34                 2               3                   1
## 6           43                 1               4                   7
##   NumStorePurchases NumWebVisitsMonth Age days_joined
## 1                 6                 1  45   198.70833
## 2                 7                 5  54   199.70833
## 3                 5                 2  57   232.70833
## 4                 2                 7  48    56.66667
## 5                 2                 7  26   149.70833
## 6                 5                 2  57   289.70833

Standardization

It’s generally recommended to standardize variables in the data set before performing subsequent analysis. Standardization makes variables comparable, when they are measured in different scales.

# scaling
df2 <- scale(df)
head(df2)
##       Income    MntWines  MntFruits MntMeatProducts MntFishProducts
## 1  1.2945477 -0.34415060  1.9511513       0.9452513       1.3399009
## 2  0.1924178  0.47107987 -0.5366661      -0.4592226      -0.5595702
## 3  0.5966592 -0.50719670 -0.3858893      -0.4815158      -0.4134571
## 4 -0.7854920 -0.87479153 -0.6623135      -0.7401173      -0.6874192
## 5 -1.2224668 -0.88664943 -0.2602420      -0.6375685      -0.4865136
## 6  0.7724026  0.09162714  2.6045175       1.0879280       3.6959757
##   MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1        3.9435854   3.35874468        -0.6880206      -0.0311165
## 2       -0.6580846  -0.13442434        -0.6880206       1.0633941
## 3       -0.6093897  -0.26951927        -0.6880206      -0.3959534
## 4       -0.6580846  -0.84849756        -0.6880206      -1.1256271
## 5       -0.6580846  -0.19232217        -0.1681932      -0.3959534
## 6        0.1210341  -0.01862868        -0.6880206      -0.0311165
##   NumCatalogPurchases NumStorePurchases NumWebVisitsMonth         Age
## 1           0.4540800        0.06121821        -1.7807855 -0.09841872
## 2           0.1124021        0.36883623        -0.1315448  0.65248524
## 3          -0.2292757       -0.24639982        -1.3684753  0.90278656
## 4          -0.9126314       -1.16925390         0.6930755  0.15188260
## 5          -0.5709535       -1.16925390         0.6930755 -1.68366041
## 6           1.4791135       -0.24639982        -1.3684753  0.90278656
##   days_joined
## 1   -1.458227
## 2   -1.453926
## 3   -1.311971
## 4   -2.069243
## 5   -1.669009
## 6   -1.066776

Clustering Tendency Assessment

Before applying any clustering algorithm to the data set, the first thing to do is to assess the clustering tendency. That is, whether applying clustering is suitable for the data.

set.seed(123)
# random data generated from the "df" data set
random_df <- apply(df, 2,
                function(x){runif(length(x), min(x), (max(x)))})
random_df <- as.data.frame(random_df)

# standardize the random data sets
random_df <- scale(random_df)

Visual Inspection

We start by visualizing the data to assess whether they contains any meaningful clusters. As the data contain more than two variables, we need to reduce the dimension in order to create a scatter plot using principal component analysis (PCA) algorithm.

As in the plots, the faithful data set may contain 2 real clusters. However, the randomly generated data set doesn’t contain any meaningful clusters.

# plot faithful data set
fviz_pca_ind(prcomp(df2), title = "PCA - Faithful data", palette = "jco",
             geom = "point", ggtheme = theme_classic())

# plot random data set
fviz_pca_ind(prcomp(random_df), title = "PCA - Random data",
             geom = "point", ggtheme = theme_classic())

Hopkins Statistic Method

We can conduct the Hopkins Statistic test iteratively, using 0.5 as the threshold to reject the alternative hypothesis. That is, if H < 0.5, then it is unlikely that the data set has statistically significant clusters.

We can see that the superstore data set is highly clusterable (H = 0.89, which is far above the threshold 0.5). However the random_df data set is not clusterable (H = 0.5).

# compute Hopkins statistic for the superstore dataset
set.seed(123)
res <- get_clust_tendency(df2, n = nrow(df2)-1, graph = FALSE)
res$hopkins_stat
## [1] 0.8915371
# compute Hopkins statistic for a random dataset
set.seed(123)
res.re <- get_clust_tendency(random_df, n = nrow(random_df)-1,
                          graph = FALSE)
res.re$hopkins_stat
## [1] 0.5009627

Visual Method

For visual evaluation of cluster tendency, the dissimilarity (DM) matrix between the objects in the data set using the Euclidean distance measure is calculated. The dissimilarity matrix images confirm that there is a cluster structure in the original data set but not in the random one.

# faithful data
fviz_dist(dist(df2), show_labels = FALSE)+ labs(title = "Faithful data")

# random data
fviz_dist(dist(random_df), show_labels = FALSE)+ labs(title = "Random data")

Cluster Algorithm Selection

Before selecting cluster algorithm, it should be noted that our dataset contains many outliers. Some clustering algorithms, such as K-means clustering, are sensitive to outliers and may not be a good method to apply. A more robust and less sensitive to noises alternative to K-means clustering is K-medoids clustering, two types of which we will consider are Partitioning around Medoids (PAM) and Clustering Large Applications (CLARA). CLARA is a variation of PAM designed to handle large dataset. In our case, we have over 2000 observations, meaning that CLARA may be more suitable than PAM. However, it should also be noted that CLARA may not produce clustering results as accurate as PAM due to its reliance on random sampling.

Internal Measures

The algorithms evaluated are: hierarchical, K-means, PAM and CLARA. 3 internal measures to consider when determining the best clustering algorithm are:

  • Connectedness – between 0 and infinity – should be minimized
  • Silhouette Coefficient - between -1 and 1 – should be maximized
  • Dunn index – between 0 and infinity – should be maximized

Among the algorithms assessed, hierarchical yield the best results for all 3 internal measures. Regardless of the clustering algorithm, the optimal number of clusters is suggested to be 2.

# define algorithms to consider
clmethods <- c("hierarchical","kmeans","pam", "clara")

# apply internal measures
intern <- clValid(df2, nClust = 2:6,
              clMethods = clmethods, maxitems = 2500, validation = "internal")
# summary
summary(intern)
## 
## Clustering Methods:
##  hierarchical kmeans pam clara 
## 
## Cluster sizes:
##  2 3 4 5 6 
## 
## Validation Measures:
##                                    2         3         4         5         6
##                                                                             
## hierarchical Connectivity     2.9290    7.6448   14.4317   18.7187   18.7187
##              Dunn             1.0428    0.5017    0.4574    0.5265    0.5265
##              Silhouette       0.8031    0.6418    0.5512    0.5036    0.4938
## kmeans       Connectivity     2.9290    7.6448   14.4317   26.3313   26.3313
##              Dunn             1.0428    0.5017    0.4574    0.1656    0.1656
##              Silhouette       0.8031    0.6418    0.5512    0.3954    0.3924
## pam          Connectivity   270.7679  540.5361  846.5698 1007.2290 1173.6619
##              Dunn             0.0293    0.0268    0.0246    0.0152    0.0154
##              Silhouette       0.3223    0.2391    0.2068    0.1235    0.1103
## clara        Connectivity   313.0734  500.9040  512.1988  710.8210 1067.2381
##              Dunn             0.0424    0.0208    0.0256    0.0152    0.0134
##              Silhouette       0.3411    0.2480    0.2449    0.1367    0.1193
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 2.9290 hierarchical 2       
## Dunn         1.0428 hierarchical 2       
## Silhouette   0.8031 hierarchical 2

Stability Measures

4 values are considered for the measurement of stability:

  • The average proportion of non-overlap (APN)
  • The average distance (AD)
  • The average distance between mean (ADM)
  • The figure of merit (FOM)

The values of APN, ADM and FOM range from 0 to 1, with smaller value corresponding to highly consistent clustering results. AD has a value between 0 and infinity, and smaller values are also preferred.

The best clustering methods and number of clusters for each stability measure are summarized as follows:

  • APN and ADM: hierarchical - 3 clusters
  • AD: PAM - 6 clusters
  • FOM: CLARA - 6 clusters
# apply stability measures
stab <- clValid(df2, nClust = 2:6, clMethods = clmethods, maxitems = 2500, validation = "stability")

# display only optimal scores
optimalScores(stab)
##            Score       Method Clusters
## APN 0.0001931362 hierarchical        3
## AD  3.5938465451          pam        6
## ADM 0.0028772905 hierarchical        3
## FOM 0.7938961611        clara        6

Hierarchical Clustering

Linkage Method

There are many linkage methods for hierarchical clustering. To check if the clustering of a particular linkage is valid, we will compute the correlation between the cophenetic distances and the original distances in the distance matrix. The closer the value of correlation coefficient is to 1, the more accurately the clustering solution reflects the data. Values above 0.75 indicates that the clustering solutions is good to move forward with.

We will consider 3 linkage methods: Ward, complete and average. Comparing the correlation coefficients between the original distance and each of the cophenetic distance of the hierarchical clustering using these 3 linkages, the average linkage method is suggested to be the best choice since it has the largest value of correlation coefficient 0.85 (> 0.7).

# compute the dissimilarity matrix
res.dist <- dist(df2, method = "euclidean")

# view the first 6*6 matrix
as.matrix(res.dist)[1:6, 1:6]
##          1        2        3        4        5        6
## 1 0.000000 7.191652 6.846000 8.270489 7.924138 5.911885
## 2 7.191652 0.000000 2.328817 3.592500 3.921566 6.054592
## 3 6.846000 2.328817 0.000000 3.136840 3.976533 5.691299
## 4 8.270489 3.592500 3.136840 0.000000 2.298475 7.174513
## 5 7.924138 3.921566 3.976533 2.298475 0.000000 7.142218
## 6 5.911885 6.054592 5.691299 7.174513 7.142218 0.000000
### Ward method
res.hc.ward <- hclust(d = res.dist, method = "ward.D2")

# compute cophenetic distance
res.coph.ward <- cophenetic(res.hc.ward)

# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.ward)
## [1] 0.5958253
### complete method
res.hc.com <- hclust(d = res.dist, method = "complete")

# compute cophenetic distance
res.coph.com <- cophenetic(res.hc.com)

# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.com)
## [1] 0.7671059
### average method
res.hc.avg <- hclust(d = res.dist, method = "average")

# compute cophenetic distance
res.coph.avg <- cophenetic(res.hc.avg)

# correlation between cophenetic distance and the original distance
cor(res.dist, res.coph.avg)
## [1] 0.8496363

Optimal Number of Clusters

To determine the optimal number of clusters, we will use 3 approaches: elbow, silhouette and gap statistic method. While the elbow and gap statistic suggest 3 clusters, the silhouette suggests 2. While 3 clusters is in agreement with our assessment from the cluster algorithm selection procedeure, ultimately, we will consider both 2 and 3 clusters to see which number produces the best clustering result.

# elbow plot
# 3 clusters suggested
fviz_nbclust(df2, hcut, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)+
  labs(subtitle = "Elbow method")

# silhouette method
# 2 clusters suggested
fviz_nbclust(df2, hcut, method = "silhouette") +
  labs(subtitle = "Silhouette method") +
  theme_classic()

# gap statistic method
# 3 clusters suggested
fviz_nbclust(df2, hcut,  method = "gap_stat",
                  nboot = 50, maxSE = list(method = 'Tibs2001SEmax',
                                            SE.factor = 1)) +
  labs(subtitle = "Gap statistic method")

Agglomerative Hierarchical Clustering

# compute agglomerative hierarchical clustering
res.agnes <- agnes(x = df2, # data matrix
                  metric = "euclidean", # metric for distance matrix
                  method = "average" # Linkage method
                   )

# dendrogram
fviz_dend(res.agnes, show_labels = FALSE,
          palette = "jco", as.ggplot = TRUE)
## Registered S3 method overwritten by 'dendextend':
##   method       from   
##   text.pvclust pvclust
## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the factoextra package.
##   Please report the issue at <]8;;https://github.com/kassambara/factoextra/issueshttps://github.com/kassambara/factoextra/issues]8;;>.

As shown in the dendrogram, there are relatively large cophenetic distance between some objects. If we cut-off at the top of the tree, we will get clusters with small number of data points. This may be because we have outliers in the data set which result in large distances between clusters.

# cut the tree into 2 groups
grp <- cutree(res.agnes, k = 2)
fviz_cluster(list(data = df2, cluster = grp),
             ellipse.type = "convex", geom = "point",
             repel = TRUE, # avoid label overplotting (slow)
             show.clust.cent = FALSE, ggtheme = theme_minimal())

# cut the tree into 3 groups
grp1 <- cutree(res.agnes, k = 3)
fviz_cluster(list(data = df2, cluster = grp1),
             ellipse.type = "convex", geom = "point",
             repel = TRUE, # avoid label overplotting (slow)
             show.clust.cent = FALSE, ggtheme = theme_minimal())

From the cluster plot, it is seen that the clusters are overlapped. Cluster 1 contains the majority of the data points and has a wide range in the plot. The other clusters with very few data points locating in the range of cluster 1. Again, it may be resulted from the outliers in the data set.

Cluster Validation

Silhouette Coefficient

###plot the silhouette for each observation
sil <- silhouette(grp, dist(df2))
#sil <- silhouette(cutree(res.hc.avg, k = 2), dist(df2))
fviz_silhouette(sil)
##   cluster size ave.sil.width
## 1       1 2215           0.8
## 2       2    1           0.0

P-value

# transpose the data set to perform the clustering on the variables
df.t <- t(df2)
set.seed(123)
ss <- sample(1:2034, 30) # extract 30 samples out of 2034
df3 <- df.t[, ss]
### Create a parallel socket cluster
set.seed(123)

cl <- makeCluster(2, type = "PSOCK")

### parallel version of pvclust
res.pv <- parPvclust(cl, df3, nboot=1000)
## Warning in parPvclust(cl, df3, nboot = 1000): "parPvclust" has been integrated into pvclust (with "parallel" option).
## It is available for back compatibility but will be unavailable in the future.
## Multiscale bootstrap... Done.
stopCluster(cl)
# Default plot
plot(res.pv, hang = -1, cex = 0.5)
pvrect(res.pv)

Values on the dendrogram are AU p-values (Red, left), BP values (green, right), and cluster labels (grey, bottom). Clusters with AU > = 95% are indicated by the rectangles and are considered to be strongly supported by data.

We can see from the plot that once we compute hierarchical clustering on the bootstrap samples, there are some meaningful and strong clusters. Because the sampling process reduces the impact of the outliers.

Overall, with outliers in our data set, hierarchical clustering is not suggested as an appropriate method.

CLARA

Optimal Number of Clusters

Similar to the procedure in hierarchical clustering, we will use the elbow, silhouette and gap statistic method to determine the optimal number of clusters for CLARA. While elbow and gap statistic method suggest 3 as the optimal number of clusters, silhouette method suggests 2.

Recall that in our cluster algorithm selection, the FOM value suggests 6 clusters as th optimal number for the CLARA algorithm. FOM has many limitations, amongst which are the lack of robustness, the susceptibility to noises and the assumption that the data has a uniform distribution. Since our dataset contains some extreme outliers and does not have a uniform distribution, FOM value does not provide a good choice for the optimal number of clusters.

In short, moving forward, we will be considering 2 clusters and 3 clusters for the CLARA algorithm.

# elbow plot
# 3 clusters suggested
fviz_nbclust(df2, clara, method = "wss") +
  geom_vline(xintercept = 3, linetype = 2)+
  labs(subtitle = "Elbow method")

# silhouette method
# 2 clusters suggested
fviz_nbclust(df2, clara, method = "silhouette")+
  labs(subtitle = "Silhouette method") + theme_classic()

# gap statistic method
# 3 clusters suggested
fviz_nbclust(df2, clara,  method = "gap_stat", nboot = 50)+
  labs(subtitle = "Gap statistic method")

CLARA Clustering

## CLARA clustering

# save the numbers of clusters to test in a list
k_list <- list(2, 3)

# empty list for clara objects
clara_list <- list()

# empty list for datasets with clustering labels
dfclusters_list <- list()

#empty list for medoids objects (dataframes)
medoids_list <- list()

# loop through list of numbers of clusters
for (i in seq_along(k_list)) {
  n_cluster <- k_list[[i]]

  # compute clara cluster for a specified number of clusters
  # save clara object to empty list
  clara.res <- clara(df2, n_cluster, samples = 200, pamLike = TRUE)
  clara_list[[i]] <- clara.res

  # save the medoids to the empty list
  medoids_list[[i]] <- clara.res$medoids

  # add clustering labels to dataset
  dfclusters <- cbind(df2, cluster = clara.res$cluster)
  dfclusters_list[[i]] <- dfclusters
}
# example: print results for CLARA with 2 clusters
# can do the same for CLARA with 3 clusters by changing the index from 1 to 2
clara_list[[1]]
## Call:     clara(x = df2, k = n_cluster, samples = 200, pamLike = TRUE) 
## Medoids:
##         Income   MntWines   MntFruits MntMeatProducts MntFishProducts
## 120  0.7277119  0.2250285  0.04131167       0.2140332       0.8467690
## 271 -0.6596830 -0.7176744 -0.58692506      -0.4458466      -0.5413061
##     MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 120       0.02364428   -0.4432128        -0.1681932       0.6985572
## 271      -0.43895746   -0.3081178        -0.1681932      -0.3959534
##     NumCatalogPurchases NumStorePurchases NumWebVisitsMonth         Age
## 120           0.7957578         0.9840723        -0.9561652  0.06844883
## 271          -0.5709535        -0.5540178         0.6930755 -0.26528627
##      days_joined
## 120  0.137690499
## 271 -0.008566125
## Objective function:   2.943866
## Clustering vector:    Named int [1:2216] 1 1 2 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 ...
##  - attr(*, "names")= chr [1:2216] "1" "2" "3" "4" "5" "6" "7" ...
## Cluster sizes:            969 1247 
## Best sample:
##  [1] 120  133  143  167  197  229  234  271  310  390  495  560  572  661  663 
## [16] 680  701  724  747  751  773  783  839  842  936  943  956  960  986  1029
## [31] 1164 1230 1349 1447 1488 1563 1598 1723 1767 1783 1840 1972 2102 2228
## 
## Available components:
##  [1] "sample"     "medoids"    "i.med"      "clustering" "objective" 
##  [6] "clusinfo"   "diss"       "call"       "silinfo"    "data"
# CLARA medoids for 2 clusters
medoids_list[[1]]
##         Income   MntWines   MntFruits MntMeatProducts MntFishProducts
## 120  0.7277119  0.2250285  0.04131167       0.2140332       0.8467690
## 271 -0.6596830 -0.7176744 -0.58692506      -0.4458466      -0.5413061
##     MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 120       0.02364428   -0.4432128        -0.1681932       0.6985572
## 271      -0.43895746   -0.3081178        -0.1681932      -0.3959534
##     NumCatalogPurchases NumStorePurchases NumWebVisitsMonth         Age
## 120           0.7957578         0.9840723        -0.9561652  0.06844883
## 271          -0.5709535        -0.5540178         0.6930755 -0.26528627
##      days_joined
## 120  0.137690499
## 271 -0.008566125
# Dataset with clustering labels for 2 clusters
head(dfclusters_list[[1]],3)
##      Income   MntWines  MntFruits MntMeatProducts MntFishProducts
## 1 1.2945477 -0.3441506  1.9511513       0.9452513       1.3399009
## 2 0.1924178  0.4710799 -0.5366661      -0.4592226      -0.5595702
## 3 0.5966592 -0.5071967 -0.3858893      -0.4815158      -0.4134571
##   MntSweetProducts MntGoldProds NumDealsPurchases NumWebPurchases
## 1        3.9435854    3.3587447        -0.6880206      -0.0311165
## 2       -0.6580846   -0.1344243        -0.6880206       1.0633941
## 3       -0.6093897   -0.2695193        -0.6880206      -0.3959534
##   NumCatalogPurchases NumStorePurchases NumWebVisitsMonth         Age
## 1           0.4540800        0.06121821        -1.7807855 -0.09841872
## 2           0.1124021        0.36883623        -0.1315448  0.65248524
## 3          -0.2292757       -0.24639982        -1.3684753  0.90278656
##   days_joined cluster
## 1   -1.458227       1
## 2   -1.453926       1
## 3   -1.311971       2

Visualization

for (i in seq_along(k_list)) {
  p <- fviz_cluster(clara_list[[i]],
             ellipse.type = "t", # Concentration ellipse
             geom = "point", pointsize = 1,
             ggtheme = theme_classic())
  print(p)
    }

Cluster Validation

Internal Validation

Silhouette (Si) coefficient: or the average silhouette width measures how similar an object i is to the other objects in its own cluster versus those in the neighbor cluster. Si values range from 1 to - 1:

  • A value of Si close to 1 indicates that the object is well clustered. In the other words, the object i is similar to the other objects in its group.
  • A value of Si close to -1 indicates that the object is poorly clustered, and that assignment to some other cluster would probably improve the overall results.

Both CLARA procedure with 2 clusters and 3 clusters have similar average Si coefficients (0.29 and 0.25). Additionally, in both cases, the average Si coefficient for the smallest cluster is around 0.11 - 0.13, while the average Si coefficient for the largest cluster is around 0.34 - 0.39. In other words, procedure with 2 clusters and 3 clusters produce the same level of poor clustering, since Si coefficients in both cases are relatively close to 0, meaning clustering is no better than random.

# plot the silhouette for each observation
for (i in seq_along(k_list)) {
  p <- fviz_silhouette(clara_list[[i]], palette = "jco",
                ggtheme = theme_classic())

  print(p)
    }
##   cluster size ave.sil.width
## 1       1   17          0.13
## 2       2   27          0.39

##   cluster size ave.sil.width
## 1       1   10          0.18
## 2       2   12          0.11
## 3       3   24          0.34

# silhouette information
silinfo_2clust <- clara_list[[1]]$silinfo
silinfo_3clust <- clara_list[[2]]$silinfo
# silhouette widths of each observation
sil_2clust <- silinfo_2clust$widths[, 1:3]
sil_3clust <- silinfo_3clust$widths[, 1:3]
# average silhouette width of each cluster
silinfo_2clust$clus.avg.widths # 2 clusters
## [1] 0.1337040 0.3875908
silinfo_3clust$clus.avg.widths # 3 clusters
## [1] 0.1752261 0.1144386 0.3427295
# the total average (mean of all individual silhouette widths)
silinfo_2clust$avg.width # 2 clusters
## [1] 0.2894982
silinfo_3clust$avg.width # 3 clusters
## [1] 0.2467616
# the size of each clusters
clara_list[[1]]$clusinfo # 2 clusters
##      size  max_diss  av_diss isolation
## [1,]  969 23.982855 3.944294  6.389860
## [2,] 1247  9.013688 2.166469  2.401557
clara_list[[2]]$clusinfo # 3 clusters
##      size  max_diss  av_diss isolation
## [1,]  546 24.172099 4.094751  5.042084
## [2,]  640  9.821765 3.080844  3.446752
## [3,] 1030  8.982295 1.956304  3.152156
### objects with negative silhouette
# 2 clusters
neg_sil_index_2clust <- which(sil_2clust[, 'sil_width'] < 0)
sil_2clust[neg_sil_index_2clust, , drop = FALSE]
##      cluster neighbor    sil_width
## 724        1        2 -0.003533189
## 1488       1        2 -0.045764156
## 751        1        2 -0.057225226
# 3 clusters
neg_sil_index_3clust <- which(sil_3clust[, 'sil_width'] < 0)
sil_3clust[neg_sil_index_3clust, , drop = FALSE]
##      cluster neighbor   sil_width
## 1439       1        2 -0.13986700
## 954        2        1 -0.04452003
## 199        2        3 -0.09428869

Dunn Index: reflects the compactness and separation of clusters, calculated as the ratio of the minimum inter-cluster distance to the maximum intra-cluster distance. Its range is from 0 to infinity. A higher Dunn index value indicates better clustering, where clusters are well separated and compact. Conversely, a lower Dunn index value indicates poor clustering, where clusters are not well separated or are too spread out.

From the results below, we can see that CLARA algorithm with 2 clusters and 3 clusters produce approximately the same values of Dunn Index that are very close to 0, indicating a poor clustering results for both cases.

for (i in seq_along(k_list)) {

  # statistics for CLARA clustering
  clara_stats <- cluster.stats(d = dist(df2),
                              clara_list[[i]]$cluster)

  # Dunn's index
  cat(k_list[[i]], "clusters - Dunn Index:", clara_stats$dunn, "\n")
}
## 2 clusters - Dunn Index: 0.02927206 
## 3 clusters - Dunn Index: 0.02609056

External Validation

In external validation, the result of CLARA clustering is compared with the true labels of the dataset (i.e. Response - 0/1) to assess if the clustering matches the data structure. The corrected Rand Index and Meila’s variation index VI are used to quantify the agreement between the CLARA clustering and the true labels. The value of both indices ranges from -1 (no agreement) to 1 (perfect agreement).

From the cross-tabulation between the Response labels and the CLARA clusters, a few points can be noted about the distribution profile:

  • When considering 2 clusters, 40.6% of the response 0 (refused the offer) is classified into cluster 1 and 59.4% into cluster 2, while 61.6% the response 1 (accepted the offer) is classified into cluster 1 and 38.4% into cluster 2. There is approximately a 40/60 split between 2 clusters in both classes of response, but the ratio is reversed from the other class.

  • When considering 3 clusters, most of the response 0 (refused the offer) is classified into cluster 3 (49.5%), while cluster 1 and 2 have about the same proportion (21.2% and 29.3%). Conversely, most of the response 1 (accepted the offer) is classified into cluster 1, while cluster 2 and 3 share about the same proportion (26.7% and 29.1%).

The Rand indices indicate that the agreement between the reponse labels and CLARA clustering is 0.03 for 2 clusters and 0.04 for 3 clusters. Both Rand indices are very close to 0, so the clustering is no better than random. Similarly, the Meila’s VI indicate an agreement of 1.09 for 2 clusters and 1.45 for 3 clusters. Note that Meila’s VI values are higher than 1, so the choice of clustering algorithm may not have been appropriate for this dataset.

### cross-tabulation between Response labels and CLARA with 2 clusters
tab_2clust <- table(super$Response, clara_list[[1]]$cluster)
dimnames(tab_2clust) <- list(Response = c(0, 1), Cluster = c(1, 2))

# count
tab_2clust
##         Cluster
## Response    1    2
##        0  764 1119
##        1  205  128
# percentage
prop.table(tab_2clust, margin = 1)*100
##         Cluster
## Response        1        2
##        0 40.57355 59.42645
##        1 61.56156 38.43844
### cross-tabulation between Response labels and CLARA with 3 clusters
tab_3clust <- table(super$Response, clara_list[[2]]$cluster)
dimnames(tab_3clust) <- list(Response = c(0, 1), Cluster = c(1, 2, 3))

# count
tab_3clust
##         Cluster
## Response   1   2   3
##        0 399 551 933
##        1 147  89  97
# percentage
prop.table(tab_3clust, margin = 1)*100
##         Cluster
## Response        1        2        3
##        0 21.18959 29.26182 49.54859
##        1 44.14414 26.72673 29.12913
# compute cluster stats
# CLARA with 2 clusters
stats_2clust <- cluster.stats(d = dist(df2),
                              super$Response,
                              clara_list[[1]]$cluster)
## Warning in cluster.stats(d = dist(df2), super$Response, clara_list[[1]]
## $cluster): clustering renumbered because maximum != number of clusters
# CLARA with 3 clusters
stats_3clust <- cluster.stats(d = dist(df2),
                              super$Response,
                              clara_list[[2]]$cluster)
## Warning in cluster.stats(d = dist(df2), super$Response, clara_list[[2]]
## $cluster): clustering renumbered because maximum != number of clusters
# corrected Rand Index
cat("\n2 clusters - corrected Rand Index:", stats_2clust$corrected.rand)
## 
## 2 clusters - corrected Rand Index: 0.03031754
cat("\n3 clusters - corrected Rand Index:", stats_3clust$corrected.rand)
## 
## 3 clusters - corrected Rand Index: 0.03936037
# Meila's variation index VI
cat("\n2 clusters - Meila's variation index VI:", stats_2clust$vi)
## 
## 2 clusters - Meila's variation index VI: 1.085724
cat("\n3 clusters - Meila's variation index VI:", stats_3clust$vi)
## 
## 3 clusters - Meila's variation index VI: 1.446868

Based on the results of both internal and external validation, it can be inferred that the clustering goodness level is equivalent for both 2-cluster and 3-cluster CLARA procedures. However, CLARA is also not a suitable choice for a clustering algorithm with this dataset.

DBSCAN

# Determine the value of epsilon
dbscan::kNNdistplot(df2, k =  4)
abline(h = 3.5, lty = 2)

# DBSCAN
set.seed(123)
db <- fpc::dbscan(df2, eps = 3.5, MinPts = 4)
# plot DBSCAN results
fviz_cluster(db, data = df2, stand = FALSE,
             ellipse = FALSE, show.clust.cent = FALSE,
             geom = "point",palette = "jco", ggtheme = theme_classic())

# internal validation - Si coefficient
# plot the silhouette for each observation
sil_db <- silhouette(db$cluster, dist(df2))
fviz_silhouette(sil_db)
##   cluster size ave.sil.width
## 0       0   25         -0.23
## 1       1 2187          0.29
## 2       2    4          0.70

Hierarchical K-means Clustering

# hierarchical k-means clustering
res.hk <-hkmeans(df2, 2) # 2 clusters
res.hk1 <-hkmeans(df2, 3) # 3 clusters
# visualize the tree
fviz_dend(res.hk, cex = 0.6, palette = "jco",
          rect = TRUE, rect_border = "jco", rect_fill = TRUE)

# visualize the final clusters

# 2 clusters
fviz_cluster(res.hk,
            ellipse.type = "t",
            geom = "point", pointsize = 1,
            palette = "jco", repel = TRUE,
            ggtheme = theme_classic())

# 3 clusters
fviz_cluster(res.hk1,
            ellipse.type = "t",
            geom = "point", pointsize = 1,
            palette = "jco", repel = TRUE,
            ggtheme = theme_classic())

# internal validation - Si coefficient
# plot the silhouette for each observation

# 2 clusters
sil_hk <- silhouette(res.hk$cluster, dist(df2))
fviz_silhouette(sil_hk)
##   cluster size ave.sil.width
## 1       1  877          0.13
## 2       2 1339          0.47

# 3 clusters
sil_hk1 <- silhouette(res.hk1$cluster, dist(df2))
fviz_silhouette(sil_hk1)
##   cluster size ave.sil.width
## 1       1  620          0.10
## 2       2  588          0.09
## 3       3 1008          0.43

Fuzzy Clustering

# fuzzy clustering
res.fanny <- fanny(df2, 2, memb.exp = 1.2) # 2 clusters
res.fanny1 <- fanny(df2, 3, memb.exp = 1.2) # 3 clusters
# visualize the final clusters

# 2 clusters
fviz_cluster(res.fanny, ellipse.type = "t", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right",
             labelsize = 0)
## Warning: ggrepel: 1756 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# 3 clusters
fviz_cluster(res.fanny1, ellipse.type = "t", repel = TRUE,
             palette = "jco", ggtheme = theme_minimal(),
             legend = "right",
             labelsize = 0)
## Warning: ggrepel: 1757 unlabeled data points (too many overlaps). Consider
## increasing max.overlaps

# internal validation - Si coefficient
# plot the silhouette for each observation

# 2 clusters
sil_fanny <- silhouette(res.fanny$cluster, dist(df2))
fviz_silhouette(sil_fanny)
##   cluster size ave.sil.width
## 1       1  997          0.10
## 2       2 1219          0.51

# 3 clusters
sil_fanny1 <- silhouette(res.fanny1$cluster, dist(df2))
fviz_silhouette(sil_fanny1)
##   cluster size ave.sil.width
## 1       1  625          0.10
## 2       2  625          0.07
## 3       3  966          0.45

Further Statistic Analysis

Stepwise Regression Model

cluster <- clara_list[[1]]$cluster #2 clusters from CLARA above
df_reg <- cbind(super, as.factor(cluster)) #same as df but include response variable
head(df_reg)
##    Education Marital_Status Income Kidhome Teenhome Recency MntWines MntFruits
## 1 Graduation       Divorced  84835       0        0       0      189       104
## 2 Graduation         Single  57091       0        0       0      464         5
## 3 Graduation        Married  67267       0        1       0      134        11
## 4 Graduation       Together  32474       1        1       0       10         0
## 5 Graduation         Single  21474       1        0       0        6        16
## 6        PhD         Single  71691       0        0       0      336       130
##   MntMeatProducts MntFishProducts MntSweetProducts MntGoldProds
## 1             379             111              189          218
## 2              64               7                0           37
## 3              59              15                2           30
## 4               1               0                0            0
## 5              24              11                0           34
## 6             411             240               32           43
##   NumDealsPurchases NumWebPurchases NumCatalogPurchases NumStorePurchases
## 1                 1               4                   4                 6
## 2                 1               7                   3                 7
## 3                 1               3                   2                 5
## 4                 1               1                   0                 2
## 5                 2               3                   1                 2
## 6                 1               4                   7                 5
##   NumWebVisitsMonth Response Complain Age days_joined as.factor(cluster)
## 1                 1        1        0  45   198.70833                  1
## 2                 5        1        0  54   199.70833                  1
## 3                 2        0        0  57   232.70833                  2
## 4                 7        0        0  48    56.66667                  2
## 5                 7        1        0  26   149.70833                  2
## 6                 2        1        0  57   289.70833                  1
colnames(df_reg)[colnames(df_reg) == "as.factor(cluster)"] <- "cluster"
str(df_reg)
## 'data.frame':    2216 obs. of  22 variables:
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Together" ...
##  $ Income             : int  84835 57091 67267 32474 21474 71691 63564 44931 65324 65324 ...
##  $ Kidhome            : int  0 0 0 1 1 0 0 0 0 0 ...
##  $ Teenhome           : int  0 0 1 1 0 0 0 1 1 1 ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 10 6 336 769 78 384 384 ...
##  $ MntFruits          : int  104 5 11 0 16 130 80 0 0 0 ...
##  $ MntMeatProducts    : int  379 64 59 1 24 411 252 11 102 102 ...
##  $ MntFishProducts    : int  111 7 15 0 11 240 15 0 21 21 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 34 0 32 32 ...
##  $ MntGoldProds       : int  218 37 30 0 34 43 65 7 5 5 ...
##  $ NumDealsPurchases  : int  1 1 1 1 2 1 1 1 3 3 ...
##  $ NumWebPurchases    : int  4 7 3 1 3 4 10 2 6 6 ...
##  $ NumCatalogPurchases: int  4 3 2 0 1 7 10 1 2 2 ...
##  $ NumStorePurchases  : int  6 7 5 2 2 5 7 3 9 9 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 7 2 6 5 4 4 ...
##  $ Response           : int  1 1 0 0 1 1 1 0 0 0 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age                : num  45 54 57 48 26 57 61 48 61 61 ...
##  $ days_joined        : num  198.7 199.7 232.7 56.7 149.7 ...
##  $ cluster            : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 2 1 1 ...
# make this example reproducible
set.seed(1)

#use 70% of dataset as training set and 30% as test set
sample <- sample(c(TRUE, FALSE), nrow(df_reg), replace=TRUE, prob=c(0.7,0.3))
train.data  <- df_reg[sample, ]
test.data   <- df_reg[!sample, ]

str(train.data)
## 'data.frame':    1551 obs. of  22 variables:
##  $ Education          : chr  "Graduation" "Graduation" "Graduation" "Graduation" ...
##  $ Marital_Status     : chr  "Divorced" "Single" "Married" "Single" ...
##  $ Income             : int  84835 57091 67267 21474 44931 65324 65324 81044 62499 67786 ...
##  $ Kidhome            : int  0 0 0 1 0 0 0 0 1 0 ...
##  $ Teenhome           : int  0 0 1 0 1 1 1 0 0 0 ...
##  $ Recency            : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ MntWines           : int  189 464 134 6 78 384 384 450 140 431 ...
##  $ MntFruits          : int  104 5 11 16 0 0 0 26 4 82 ...
##  $ MntMeatProducts    : int  379 64 59 24 11 102 102 535 61 441 ...
##  $ MntFishProducts    : int  111 7 15 11 0 21 21 73 0 80 ...
##  $ MntSweetProducts   : int  189 0 2 0 0 32 32 98 13 20 ...
##  $ MntGoldProds       : int  218 37 30 34 7 5 5 26 4 102 ...
##  $ NumDealsPurchases  : int  1 1 1 2 1 3 3 1 2 1 ...
##  $ NumWebPurchases    : int  4 7 3 3 2 6 6 5 3 3 ...
##  $ NumCatalogPurchases: int  4 3 2 1 1 2 2 6 1 6 ...
##  $ NumStorePurchases  : int  6 7 5 2 3 9 9 10 6 6 ...
##  $ NumWebVisitsMonth  : int  1 5 2 7 5 4 4 1 4 1 ...
##  $ Response           : int  1 1 0 1 0 0 0 0 0 1 ...
##  $ Complain           : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ Age                : num  45 54 57 26 48 61 61 68 36 56 ...
##  $ days_joined        : num  199 200 233 150 348 ...
##  $ cluster            : Factor w/ 2 levels "1","2": 1 1 2 2 2 1 1 1 2 1 ...

Stepwise Logistic Regression with Clustering Label

# model
model <- glm(Response ~., data = train.data, family = binomial(logit)) %>%
  stepAIC(trace = TRUE)
## Start:  AIC=982.58
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntFruits + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Complain + Age + days_joined + cluster
## 
##                       Df Deviance     AIC
## - MntFruits            1   920.69  980.69
## - Complain             1   920.79  980.79
## - Age                  1   921.08  981.08
## - MntFishProducts      1   922.22  982.22
## - NumDealsPurchases    1   922.27  982.27
## - Income               1   922.48  982.48
## <none>                     920.58  982.58
## - MntSweetProducts     1   923.93  983.93
## - Kidhome              1   924.05  984.05
## - cluster              1   924.47  984.47
## - MntGoldProds         1   928.26  988.26
## - NumWebVisitsMonth    1   928.78  988.78
## - NumCatalogPurchases  1   929.87  989.87
## - Education            4   936.27  990.27
## - MntWines             1   931.66  991.66
## - NumWebPurchases      1   936.32  996.32
## - MntMeatProducts      1   941.17 1001.17
## - NumStorePurchases    1   947.89 1007.89
## - Teenhome             1   948.52 1008.52
## - days_joined          1   954.32 1014.32
## - Marital_Status       7   971.86 1019.86
## - Recency              1  1033.06 1093.06
## 
## Step:  AIC=980.69
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Complain + Age + days_joined + cluster
## 
##                       Df Deviance     AIC
## - Complain             1   920.91  978.91
## - Age                  1   921.19  979.19
## - MntFishProducts      1   922.24  980.24
## - NumDealsPurchases    1   922.41  980.41
## - Income               1   922.61  980.61
## <none>                     920.69  980.69
## - Kidhome              1   924.15  982.15
## - MntSweetProducts     1   924.30  982.30
## - cluster              1   924.48  982.48
## - MntGoldProds         1   928.60  986.60
## - NumWebVisitsMonth    1   928.86  986.86
## - NumCatalogPurchases  1   930.01  988.01
## - Education            4   936.27  988.27
## - MntWines             1   931.69  989.69
## - NumWebPurchases      1   936.42  994.42
## - MntMeatProducts      1   942.35 1000.35
## - NumStorePurchases    1   948.00 1006.00
## - Teenhome             1   949.03 1007.03
## - days_joined          1   954.41 1012.41
## - Marital_Status       7   972.00 1018.00
## - Recency              1  1033.14 1091.14
## 
## Step:  AIC=978.91
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Age + days_joined + cluster
## 
##                       Df Deviance     AIC
## - Age                  1   921.42  977.42
## - MntFishProducts      1   922.41  978.41
## - NumDealsPurchases    1   922.59  978.59
## - Income               1   922.83  978.83
## <none>                     920.91  978.91
## - Kidhome              1   924.28  980.28
## - MntSweetProducts     1   924.49  980.49
## - cluster              1   924.71  980.71
## - MntGoldProds         1   928.85  984.85
## - NumWebVisitsMonth    1   929.03  985.03
## - NumCatalogPurchases  1   930.12  986.12
## - Education            4   936.57  986.57
## - MntWines             1   932.04  988.04
## - NumWebPurchases      1   936.57  992.57
## - MntMeatProducts      1   942.51  998.51
## - NumStorePurchases    1   948.48 1004.48
## - Teenhome             1   949.30 1005.30
## - days_joined          1   954.51 1010.51
## - Marital_Status       7   972.03 1016.03
## - Recency              1  1033.41 1089.41
## 
## Step:  AIC=977.42
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     days_joined + cluster
## 
##                       Df Deviance     AIC
## - MntFishProducts      1   922.86  976.86
## - NumDealsPurchases    1   923.09  977.09
## - Income               1   923.38  977.38
## <none>                     921.42  977.42
## - Kidhome              1   924.56  978.56
## - MntSweetProducts     1   924.95  978.95
## - cluster              1   925.06  979.06
## - MntGoldProds         1   929.31  983.31
## - NumWebVisitsMonth    1   929.44  983.44
## - NumCatalogPurchases  1   930.75  984.75
## - Education            4   937.99  985.99
## - MntWines             1   932.55  986.55
## - NumWebPurchases      1   937.14  991.14
## - MntMeatProducts      1   942.70  996.70
## - NumStorePurchases    1   949.45 1003.45
## - Teenhome             1   949.77 1003.77
## - days_joined          1   954.69 1008.69
## - Marital_Status       7   973.03 1015.03
## - Recency              1  1033.66 1087.66
## 
## Step:  AIC=976.86
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntSweetProducts + 
##     MntGoldProds + NumDealsPurchases + NumWebPurchases + NumCatalogPurchases + 
##     NumStorePurchases + NumWebVisitsMonth + days_joined + cluster
## 
##                       Df Deviance     AIC
## - NumDealsPurchases    1   924.37  976.37
## <none>                     922.86  976.86
## - Income               1   924.91  976.91
## - MntSweetProducts     1   925.58  977.58
## - Kidhome              1   925.95  977.95
## - cluster              1   927.44  979.44
## - MntGoldProds         1   929.84  981.84
## - NumWebVisitsMonth    1   931.50  983.50
## - NumCatalogPurchases  1   931.93  983.93
## - Education            4   940.26  986.26
## - MntWines             1   935.00  987.00
## - NumWebPurchases      1   938.48  990.48
## - MntMeatProducts      1   943.05  995.05
## - NumStorePurchases    1   950.88 1002.88
## - Teenhome             1   950.99 1002.99
## - days_joined          1   955.37 1007.37
## - Marital_Status       7   974.49 1014.49
## - Recency              1  1034.44 1086.44
## 
## Step:  AIC=976.37
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntSweetProducts + 
##     MntGoldProds + NumWebPurchases + NumCatalogPurchases + NumStorePurchases + 
##     NumWebVisitsMonth + days_joined + cluster
## 
##                       Df Deviance     AIC
## <none>                     924.37  976.37
## - Kidhome              1   926.44  976.44
## - Income               1   926.51  976.51
## - MntSweetProducts     1   927.16  977.16
## - cluster              1   928.98  978.98
## - MntGoldProds         1   930.87  980.87
## - NumWebVisitsMonth    1   931.81  981.81
## - NumCatalogPurchases  1   932.67  982.67
## - Education            4   941.68  985.68
## - MntWines             1   936.66  986.66
## - NumWebPurchases      1   938.83  988.83
## - MntMeatProducts      1   943.88  993.88
## - NumStorePurchases    1   954.26 1004.26
## - days_joined          1   955.75 1005.75
## - Marital_Status       7   975.52 1013.52
## - Teenhome             1   964.16 1014.16
## - Recency              1  1034.60 1084.60
summary(model)
## 
## Call:
## glm(formula = Response ~ Education + Marital_Status + Income + 
##     Kidhome + Teenhome + Recency + MntWines + MntMeatProducts + 
##     MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases + 
##     NumStorePurchases + NumWebVisitsMonth + days_joined + cluster, 
##     family = binomial(logit), data = train.data)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.4435  -0.4787  -0.2679  -0.1174   2.8974  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -2.393e+00  1.652e+00  -1.449 0.147436    
## EducationBasic         -1.100e+00  8.170e-01  -1.347 0.178132    
## EducationGraduation     9.478e-03  3.163e-01   0.030 0.976098    
## EducationMaster         2.977e-01  3.632e-01   0.820 0.412424    
## EducationPhD            8.041e-01  3.458e-01   2.325 0.020063 *  
## Marital_StatusAlone    -1.469e+01  6.044e+02  -0.024 0.980604    
## Marital_StatusDivorced -1.146e+00  1.559e+00  -0.735 0.462256    
## Marital_StatusMarried  -2.264e+00  1.551e+00  -1.460 0.144348    
## Marital_StatusSingle   -1.177e+00  1.554e+00  -0.758 0.448620    
## Marital_StatusTogether -2.491e+00  1.559e+00  -1.598 0.110005    
## Marital_StatusWidow    -1.208e+00  1.587e+00  -0.761 0.446649    
## Marital_StatusYOLO      1.267e+01  8.827e+02   0.014 0.988548    
## Income                  4.814e-06  3.170e-06   1.519 0.128846    
## Kidhome                 3.235e-01  2.247e-01   1.440 0.149848    
## Teenhome               -1.200e+00  2.004e-01  -5.987 2.14e-09 ***
## Recency                -3.203e-02  3.334e-03  -9.605  < 2e-16 ***
## MntWines                1.312e-03  3.766e-04   3.484 0.000495 ***
## MntMeatProducts         2.345e-03  5.379e-04   4.360 1.30e-05 ***
## MntSweetProducts        4.028e-03  2.398e-03   1.679 0.093072 .  
## MntGoldProds            4.531e-03  1.763e-03   2.570 0.010176 *  
## NumWebPurchases         1.594e-01  4.254e-02   3.748 0.000178 ***
## NumCatalogPurchases     1.331e-01  4.693e-02   2.836 0.004571 ** 
## NumStorePurchases      -2.121e-01  4.018e-02  -5.278 1.31e-07 ***
## NumWebVisitsMonth       1.471e-01  5.253e-02   2.800 0.005110 ** 
## days_joined             2.263e-03  4.146e-04   5.458 4.82e-08 ***
## cluster2                8.309e-01  3.886e-01   2.138 0.032504 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1336.47  on 1550  degrees of freedom
## Residual deviance:  924.37  on 1525  degrees of freedom
## AIC: 976.37
## 
## Number of Fisher Scoring iterations: 13
# Make predictions
probabilities <- model %>% predict(test.data, type = "response")
predicted.classes <- ifelse(probabilities > 0.5, 1, 0)
# Model accuracy
mean(predicted.classes==test.data$Response)
## [1] 0.8586466
confusion_matrix <- table(test.data$Response, predicted.classes)
confusion_matrix
##    predicted.classes
##       0   1
##   0 536  36
##   1  58  35
confusionMatrix(confusion_matrix, positive = "1")
## Registered S3 methods overwritten by 'proxy':
##   method               from    
##   print.registry_field registry
##   print.registry_entry registry
## Confusion Matrix and Statistics
## 
##    predicted.classes
##       0   1
##   0 536  36
##   1  58  35
##                                           
##                Accuracy : 0.8586          
##                  95% CI : (0.8298, 0.8842)
##     No Information Rate : 0.8932          
##     P-Value [Acc > NIR] : 0.99775         
##                                           
##                   Kappa : 0.3479          
##                                           
##  Mcnemar's Test P-Value : 0.03031         
##                                           
##             Sensitivity : 0.49296         
##             Specificity : 0.90236         
##          Pos Pred Value : 0.37634         
##          Neg Pred Value : 0.93706         
##              Prevalence : 0.10677         
##          Detection Rate : 0.05263         
##    Detection Prevalence : 0.13985         
##       Balanced Accuracy : 0.69766         
##                                           
##        'Positive' Class : 1               
## 

Stepwise Logistic Regression without Clustering Label

We want to compare the accuracy of the regression model with or without the clustering label. Therefore we fit the same dataset to stepwise logistic regression model, just exclude clustering label this time.

# model
model2 <- glm(Response ~., data = train.data[-22], family = binomial(logit)) %>%
  stepAIC(trace = TRUE)
## Start:  AIC=984.47
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntFruits + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Complain + Age + days_joined
## 
##                       Df Deviance     AIC
## - MntFruits            1   924.48  982.48
## - Complain             1   924.70  982.70
## - Age                  1   924.80  982.80
## - Income               1   925.50  983.50
## - NumDealsPurchases    1   926.26  984.26
## <none>                     924.47  984.47
## - MntFishProducts      1   926.98  984.98
## - MntSweetProducts     1   927.12  985.12
## - Kidhome              1   928.72  986.72
## - NumCatalogPurchases  1   931.57  989.57
## - MntGoldProds         1   932.24  990.24
## - MntWines             1   932.81  990.81
## - Education            4   940.02  992.02
## - NumWebVisitsMonth    1   934.99  992.99
## - NumWebPurchases      1   937.74  995.74
## - MntMeatProducts      1   943.60 1001.60
## - Teenhome             1   951.42 1009.42
## - days_joined          1   958.64 1016.64
## - NumStorePurchases    1   960.78 1018.78
## - Marital_Status       7   975.34 1021.34
## - Recency              1  1035.75 1093.75
## 
## Step:  AIC=982.48
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Complain + Age + days_joined
## 
##                       Df Deviance     AIC
## - Complain             1   924.71  980.71
## - Age                  1   924.82  980.82
## - Income               1   925.52  981.52
## - NumDealsPurchases    1   926.29  982.29
## <none>                     924.48  982.48
## - MntFishProducts      1   927.00  983.00
## - MntSweetProducts     1   927.29  983.29
## - Kidhome              1   928.73  984.73
## - NumCatalogPurchases  1   931.62  987.62
## - MntGoldProds         1   932.38  988.38
## - MntWines             1   932.81  988.81
## - Education            4   940.05  990.05
## - NumWebVisitsMonth    1   934.99  990.99
## - NumWebPurchases      1   937.77  993.77
## - MntMeatProducts      1   944.42 1000.42
## - Teenhome             1   951.69 1007.69
## - days_joined          1   958.65 1014.65
## - NumStorePurchases    1   960.78 1016.78
## - Marital_Status       7   975.38 1019.38
## - Recency              1  1035.78 1091.78
## 
## Step:  AIC=980.71
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     Age + days_joined
## 
##                       Df Deviance     AIC
## - Age                  1   925.06  979.06
## - Income               1   925.76  979.76
## - NumDealsPurchases    1   926.48  980.48
## <none>                     924.71  980.71
## - MntFishProducts      1   927.18  981.18
## - MntSweetProducts     1   927.49  981.49
## - Kidhome              1   928.85  982.85
## - NumCatalogPurchases  1   931.75  985.75
## - MntGoldProds         1   932.64  986.64
## - MntWines             1   933.16  987.16
## - Education            4   940.35  988.35
## - NumWebVisitsMonth    1   935.17  989.17
## - NumWebPurchases      1   937.93  991.93
## - MntMeatProducts      1   944.59  998.59
## - Teenhome             1   951.97 1005.97
## - days_joined          1   958.75 1012.75
## - NumStorePurchases    1   961.35 1015.35
## - Marital_Status       7   975.41 1017.41
## - Recency              1  1036.06 1090.06
## 
## Step:  AIC=979.06
## Response ~ Education + Marital_Status + Income + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     days_joined
## 
##                       Df Deviance     AIC
## - Income               1   926.14  978.14
## - NumDealsPurchases    1   926.81  978.81
## <none>                     925.06  979.06
## - MntFishProducts      1   927.44  979.44
## - MntSweetProducts     1   927.81  979.81
## - Kidhome              1   928.99  980.99
## - NumCatalogPurchases  1   932.25  984.25
## - MntGoldProds         1   932.95  984.95
## - MntWines             1   933.58  985.58
## - NumWebVisitsMonth    1   935.37  987.37
## - Education            4   941.50  987.50
## - NumWebPurchases      1   938.39  990.39
## - MntMeatProducts      1   944.71  996.71
## - Teenhome             1   952.55 1004.55
## - days_joined          1   958.84 1010.84
## - NumStorePurchases    1   961.99 1013.99
## - Marital_Status       7   976.14 1016.14
## - Recency              1  1036.22 1088.22
## 
## Step:  AIC=978.14
## Response ~ Education + Marital_Status + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumDealsPurchases + NumWebPurchases + 
##     NumCatalogPurchases + NumStorePurchases + NumWebVisitsMonth + 
##     days_joined
## 
##                       Df Deviance     AIC
## - NumDealsPurchases    1   927.94  977.94
## <none>                     926.14  978.14
## - MntFishProducts      1   928.51  978.51
## - MntSweetProducts     1   929.04  979.04
## - Kidhome              1   930.29  980.29
## - NumCatalogPurchases  1   933.54  983.54
## - MntGoldProds         1   933.96  983.96
## - NumWebVisitsMonth    1   935.51  985.51
## - MntWines             1   935.51  985.51
## - Education            4   942.86  986.86
## - NumWebPurchases      1   940.22  990.22
## - MntMeatProducts      1   946.52  996.52
## - Teenhome             1   953.03 1003.03
## - days_joined          1   959.98 1009.98
## - NumStorePurchases    1   962.48 1012.48
## - Marital_Status       7   976.90 1014.90
## - Recency              1  1037.82 1087.82
## 
## Step:  AIC=977.94
## Response ~ Education + Marital_Status + Kidhome + Teenhome + 
##     Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases + 
##     NumStorePurchases + NumWebVisitsMonth + days_joined
## 
##                       Df Deviance     AIC
## <none>                     927.94  977.94
## - MntFishProducts      1   930.11  978.11
## - Kidhome              1   930.80  978.80
## - MntSweetProducts     1   930.84  978.84
## - NumCatalogPurchases  1   934.63  982.63
## - MntGoldProds         1   935.18  983.18
## - NumWebVisitsMonth    1   935.89  983.89
## - MntWines             1   937.48  985.48
## - Education            4   944.60  986.60
## - NumWebPurchases      1   940.71  988.71
## - MntMeatProducts      1   947.58  995.58
## - days_joined          1   960.43 1008.43
## - Marital_Status       7   978.18 1014.18
## - Teenhome             1   966.48 1014.48
## - NumStorePurchases    1   966.74 1014.74
## - Recency              1  1038.03 1086.03
summary(model2)
## 
## Call:
## glm(formula = Response ~ Education + Marital_Status + Kidhome + 
##     Teenhome + Recency + MntWines + MntMeatProducts + MntFishProducts + 
##     MntSweetProducts + MntGoldProds + NumWebPurchases + NumCatalogPurchases + 
##     NumStorePurchases + NumWebVisitsMonth + days_joined, family = binomial(logit), 
##     data = train.data[-22])
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -3.3784  -0.4862  -0.2731  -0.1159   2.7199  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            -1.153e+00  1.613e+00  -0.715 0.474656    
## EducationBasic         -1.218e+00  8.172e-01  -1.491 0.136043    
## EducationGraduation    -1.992e-02  3.141e-01  -0.063 0.949424    
## EducationMaster         2.680e-01  3.624e-01   0.740 0.459603    
## EducationPhD            7.433e-01  3.445e-01   2.157 0.030984 *  
## Marital_StatusAlone    -1.458e+01  6.109e+02  -0.024 0.980963    
## Marital_StatusDivorced -1.332e+00  1.549e+00  -0.860 0.389728    
## Marital_StatusMarried  -2.432e+00  1.541e+00  -1.578 0.114496    
## Marital_StatusSingle   -1.333e+00  1.543e+00  -0.864 0.387848    
## Marital_StatusTogether -2.627e+00  1.548e+00  -1.697 0.089683 .  
## Marital_StatusWidow    -1.423e+00  1.579e+00  -0.901 0.367658    
## Marital_StatusYOLO      1.281e+01  8.827e+02   0.015 0.988421    
## Kidhome                 3.792e-01  2.239e-01   1.694 0.090299 .  
## Teenhome               -1.181e+00  2.003e-01  -5.893 3.80e-09 ***
## Recency                -3.193e-02  3.324e-03  -9.604  < 2e-16 ***
## MntWines                1.104e-03  3.570e-04   3.092 0.001986 ** 
## MntMeatProducts         2.360e-03  5.384e-04   4.383 1.17e-05 ***
## MntFishProducts        -2.906e-03  1.995e-03  -1.457 0.145150    
## MntSweetProducts        4.172e-03  2.439e-03   1.710 0.087224 .  
## MntGoldProds            4.865e-03  1.796e-03   2.709 0.006747 ** 
## NumWebPurchases         1.454e-01  4.115e-02   3.535 0.000408 ***
## NumCatalogPurchases     1.174e-01  4.566e-02   2.570 0.010165 *  
## NumStorePurchases      -2.300e-01  3.889e-02  -5.914 3.33e-09 ***
## NumWebVisitsMonth       1.451e-01  5.023e-02   2.889 0.003865 ** 
## days_joined             2.294e-03  4.134e-04   5.549 2.88e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1336.47  on 1550  degrees of freedom
## Residual deviance:  927.94  on 1526  degrees of freedom
## AIC: 977.94
## 
## Number of Fisher Scoring iterations: 13
# Make predictions
probabilities2 <- model2 %>% predict(test.data[-22], type = "response")
predicted.classes2 <- ifelse(probabilities2 > 0.5, 1, 0)
# Model accuracy
mean(predicted.classes2==test.data$Response)
## [1] 0.8661654
confusion_matrix2 <- table(test.data$Response, predicted.classes2)
confusion_matrix2
##    predicted.classes2
##       0   1
##   0 540  32
##   1  57  36
confusionMatrix(confusion_matrix2, positive = "1")
## Confusion Matrix and Statistics
## 
##    predicted.classes2
##       0   1
##   0 540  32
##   1  57  36
##                                           
##                Accuracy : 0.8662          
##                  95% CI : (0.8379, 0.8911)
##     No Information Rate : 0.8977          
##     P-Value [Acc > NIR] : 0.99603         
##                                           
##                   Kappa : 0.3732          
##                                           
##  Mcnemar's Test P-Value : 0.01096         
##                                           
##             Sensitivity : 0.52941         
##             Specificity : 0.90452         
##          Pos Pred Value : 0.38710         
##          Neg Pred Value : 0.94406         
##              Prevalence : 0.10226         
##          Detection Rate : 0.05414         
##    Detection Prevalence : 0.13985         
##       Balanced Accuracy : 0.71697         
##                                           
##        'Positive' Class : 1               
## 

Distribution of Clusters in Categorical Variables

### cross-tabulation between Response and CLARA clusters
tab_response <- table(super$Response, cluster)

# count
tab_response
##    cluster
##        1    2
##   0  764 1119
##   1  205  128
# percentage
prop.table(tab_response, margin = 1)*100
##    cluster
##            1        2
##   0 40.57355 59.42645
##   1 61.56156 38.43844
# perform chi-square test
chisq.test(tab_response)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  tab_response
## X-squared = 49.805, df = 1, p-value = 1.698e-12
# create the side-by-side bar graph
barplot(tab_response, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
        col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(tab_response),
        args.legend = list(title = "Response", x = "top"))

Marital_Status

There’re several variables for marital status, such as married, divorced, together, single, we make it into two groups: married / together for one group and other status for another group.

### cross-tabulation between Marital_Status and CLARA clusters
tab_ms <- table(super$Marital_Status, cluster)

# count
tab_ms
##           cluster
##              1   2
##   Absurd     2   0
##   Alone      0   3
##   Divorced 106 126
##   Married  367 490
##   Single   199 272
##   Together 251 322
##   Widow     44  32
##   YOLO       0   2
# percentage
prop.table(tab_ms, margin = 1)*100
##           cluster
##                    1         2
##   Absurd   100.00000   0.00000
##   Alone      0.00000 100.00000
##   Divorced  45.68966  54.31034
##   Married   42.82380  57.17620
##   Single    42.25053  57.74947
##   Together  43.80454  56.19546
##   Widow     57.89474  42.10526
##   YOLO       0.00000 100.00000
### group the data
tab_ms <- table(super$Marital_Status, cluster)
new_row_married <- colSums(tab_ms[c(4,6),])
new_row_others <- colSums(tab_ms[c(1:3,5,7,8),])
new_table_ms <- rbind(new_row_married, new_row_others)
rownames(new_table_ms)[1] <- "Married / Together"
rownames(new_table_ms)[2] <- "Others"

# count
new_table_ms
##                      1   2
## Married / Together 618 812
## Others             351 435
# percentage
prop.table(new_table_ms, margin = 1)*100
##                           1        2
## Married / Together 43.21678 56.78322
## Others             44.65649 55.34351

From the chi-square test and side-by-side boxplot, it’s found that the marital status (married / together and others) are independent with the clustering group. In chi-square test, p-value is 0.54, null hypothesis is failed to reject.

# perform chi-square test
chisq.test(new_table_ms)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  new_table_ms
## X-squared = 0.37075, df = 1, p-value = 0.5426
# create the side-by-side bar graph
barplot(new_table_ms, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
        col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_ms),
        args.legend = list(title = "Marital Status", x = "topleft"))

Kidhome

For number of kids at home, we also make it into two groups, no kids and with kids. The percentage of response and marital status seems not have huge different between 2 clusters. What we are interested is “with / without kids” at home.

### cross-tabulation between Kidhome and CLARA clusters
tab_kid <- table(super$Kidhome, cluster)

# count
tab_kid
##    cluster
##       1   2
##   0 885 398
##   1  82 805
##   2   2  44
# percentage
prop.table(tab_kid, margin = 1)*100
##    cluster
##             1         2
##   0 68.978956 31.021044
##   1  9.244645 90.755355
##   2  4.347826 95.652174
### group the data
tab_kid <- table(super$Kidhome, cluster)
new_row <- colSums(tab_kid[2:3,])
new_table_kid <- rbind(tab_kid[1,], new_row)
rownames(new_table_kid)[1] <- "No Kids at home"
rownames(new_table_kid)[2] <- "With Kids at home"

# count
new_table_kid
##                     1   2
## No Kids at home   885 398
## With Kids at home  84 849
# percentage
prop.table(new_table_kid, margin = 1)*100
##                           1        2
## No Kids at home   68.978956 31.02104
## With Kids at home  9.003215 90.99678

In contrast to marital status, side-by-side boxplot suggests that with or without kids may associate with the cluster group. Chi-square test further proved that as p-value is smaller than 0.05.

# perform chi-square test
chisq.test(new_table_kid)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  new_table_kid
## X-squared = 787.22, df = 1, p-value < 2.2e-16
# create the side-by-side bar graph
barplot(new_table_kid, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
        col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_kid),
        args.legend = list(title = "Kids Status", x = "top"))

Teenhome

### cross-tabulation between Teenhome and CLARA clusters
tab_teen <- table(super$Teenhome, cluster)

# count
tab_teen
##    cluster
##       1   2
##   0 541 606
##   1 404 614
##   2  24  27
# percentage
prop.table(tab_teen, margin = 1)*100
##    cluster
##            1        2
##   0 47.16652 52.83348
##   1 39.68566 60.31434
##   2 47.05882 52.94118
### group the data
tab_teen <- table(super$Teenhome, cluster)
new_row <- colSums(tab_teen[2:3,])
new_table_teen <- rbind(tab_teen[1,], new_row)
rownames(new_table_teen)[1] <- "No Teens at home"
rownames(new_table_teen)[2] <- "With Teens at home"

# count
new_table_teen
##                      1   2
## No Teens at home   541 606
## With Teens at home 428 641
# percentage
prop.table(new_table_teen, margin = 1)*100
##                           1        2
## No Teens at home   47.16652 52.83348
## With Teens at home 40.03742 59.96258
# perform chi-square test
chisq.test(new_table_teen)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  new_table_teen
## X-squared = 11.141, df = 1, p-value = 0.0008446
# create the side-by-side bar graph
barplot(new_table_teen, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
        col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_teen),
        args.legend = list(title = "Teens Status", x = "top"))

Education

### cross-tabulation between Teenhome and CLARA clusters
tab_education <- table(super$Education, cluster)

# count
tab_education
##             cluster
##                1   2
##   2n Cycle    76 124
##   Basic        1  53
##   Graduation 507 609
##   Master     144 221
##   PhD        241 240
# percentage
prop.table(tab_education, margin = 1)*100
##             cluster
##                      1         2
##   2n Cycle   38.000000 62.000000
##   Basic       1.851852 98.148148
##   Graduation 45.430108 54.569892
##   Master     39.452055 60.547945
##   PhD        50.103950 49.896050
### group the data
tab_edu <- table(super$Education, cluster)
new_row <- colSums(tab_edu[c(2,3),])
new_row1 <- colSums(tab_edu[c(1, 4:5),])
new_table_edu <- rbind(new_row, new_row1)
rownames(new_table_edu)[1] <- "Basic Education / Graduation"
rownames(new_table_edu)[2] <- "Higher Education"

# count
new_table_edu
##                                1   2
## Basic Education / Graduation 508 662
## Higher Education             461 585
# percentage
prop.table(new_table_edu, margin = 1)*100
##                                     1        2
## Basic Education / Graduation 43.41880 56.58120
## Higher Education             44.07266 55.92734
# perform chi-square test
chisq.test(new_table_edu)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  new_table_edu
## X-squared = 0.07122, df = 1, p-value = 0.7896
# create the side-by-side bar graph
barplot(new_table_edu, beside = TRUE, xlab = "Cluster", ylab = "Frequency",
        col = c("#D3D3D3", "#3F7ED5"), legend.text = rownames(new_table_edu),
        args.legend = list(title = "Education", x = "topleft"))

Conclusion

  • Clustering tendency of the dataset is verified by the Hopkins statistic method and visualization techniques, and compared it with a random dataset.

  • Hierarchical clustering may not be appropriate for this dataset, even though it’s the best algorithm recommended.

  • 2 clusters is better for the dataset. CLARA, hierarchical k-means and fuzzy k-means clustering are the preferred methods.

  • Clustering algorithms can identify patterns or structures. By grouping similar data points together into clusters, it is easier to identify meaningful patterns and relationships among the data.

  • To predict the response variable, logistic regression is preferred. When we consider logistic regression or clustering algorithm to analyze the data, it depends on the purpose of the study.

  • Clustering group is associated with number of kids.

References

Ilu, Saratu Yusuf, and Rajesh Prasad. “Improved Autoregressive Integrated Moving Average Model for COVID-19 Prediction by Using Statistical Significance and Clustering Techniques.” Heliyon 9, no. 2 (February 1, 2023). doi:10.1016/j.heliyon.2023.e13483.